AITopics

Country: Asia > Middle East (0.46)

Genre: Research Report (0.46)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

Neural Information Processing SystemsFeb-10-2026, 21:38:47 GMT

Appendix

Weevaluated all models onthree additional tasks, beyond those presented inthe main paper. Point-of-no-return (PNR) temporal localization error:Given a video clip of a state change, the networkhastoestimate thetimeatwhich astatechange begins. More specifically,themodel tries toestimate the keyframe within the video clip that contains the point-of-no-return (the time when the state change begins). The occurrence ofstate change isthen predicted bytraining abinary linear classifier, using the concatenated representations as input. ActionRecognition(AR)w/audio:Forthistask,videoembeddings fromfV andaudioembedding from fA are concatenated together and passed through two separate linear classifiers to classify the'verb' and'noun' of the action occurring in the video clip.

artificial intelligence, machine learning, state change, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.57)

Neural Information Processing SystemsFeb-10-2026, 21:38:43 GMT

LearningState-AwareVisualRepresentationsfrom AudibleInteractions

We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effectiverepresentations require focusing onmoments intimewhen interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multimodal learning frameworks encourage representation invariance over time.

artificial intelligence, machine learning, representation, (17 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.34)

Neural Information Processing SystemsFeb-8-2026, 20:42:07 GMT

CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency.

machine learning, natural language, translation, (17 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Michigan (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.46)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)

arXiv.org Artificial IntelligenceNov-27-2025

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Wang, Qineng, Huang, Wenlong, Zhou, Yu, Yin, Hang, Bao, Tianwei, Lyu, Jianwen, Liu, Weiyu, Zhang, Ruohan, Wu, Jiajun, Fei-Fei, Li, Li, Manling

Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

2511.20937

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

arXiv.org Artificial IntelligenceNov-20-2025

Computer-Use Agents as Judges for Generative User Interface

Lin, Kevin Qinghong, Hu, Siyuan, Li, Linjie, Yang, Zhengyuan, Wang, Lijuan, Torr, Philip, Shou, Mike Zheng

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and usability--forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.

large language model, machine learning, natural language, (19 more...)

2511.15567

Genre: Research Report (0.82)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Mandikal, Priyanka, Hu, Jiaheng, Dass, Shivin, Majumder, Sagnik, Martín-Martín, Roberto, Grauman, Kristen

Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress

arXiv.org Artificial IntelligenceSep-30-2025

Most robot manipulation focuses on changing the kinematic state of objects: picking, placing, opening, or rotating them. However, a wide range of real-world manipulation tasks involve a different class of object state change--such as mashing, spreading, or slicing--where the object's physical and visual state evolve progressively without necessarily changing its position. We present SPARTA, the first unified framework for the family of object state change manipulation tasks. Our key insight is that these tasks share a common structural pattern: they involve spatially-progressing, object-centric changes that can be represented as regions transitioning from an actionable to a transformed state. Building on this insight, SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions for specific object state change tasks, to generate a) structured policy observations that strip away appearance variability, and b) dense rewards that capture incremental progress over time. These are leveraged in two SPARTA policy variants: reinforcement learning for fine-grained control without demonstrations or simulation; and greedy control for fast, lightweight deployment. We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects, achieving significant improvements in training time and accuracy over sparse rewards and visual goal-conditioned baselines. Our results highlight progress-aware visual representations as a versatile foundation for the broader family of object state manipulation tasks. Project website: https://vision.cs.utexas.edu/projects/sparta-robot

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2509.24129

Country: North America (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

arXiv.org Artificial IntelligenceSep-29-2025

MMPlanner: Zero-Shot Multimodal Procedural Planning with Chain-of-Thought Object State Reasoning

Tabassum, Afrina, Guo, Bin, Ma, Xiyao, Eldardiry, Hoda, Lourentzou, Ismini

Multimodal Procedural Planning (MPP) aims to generate step-by-step instructions that combine text and images, with the central challenge of preserving object-state consistency across modalities while producing informative plans. Existing approaches often leverage large language models (LLMs) to refine textual steps; however, visual object-state alignment and systematic evaluation are largely underexplored. We present MMPlanner, a zero-shot MPP framework that introduces Object State Reasoning Chain-of-Thought (OSR-CoT) prompting to explicitly model object-state transitions and generate accurate multimodal plans. To assess plan quality, we design LLM-as-a-judge protocols for planning accuracy and cross-modal alignment, and further propose a visual step-reordering task to measure temporal coherence. Experiments on RECIPEPLAN and WIKIPLAN show that MMPlanner achieves state-of-the-art performance, improving textual planning by +6.8%, cross-modal alignment by +11.9%, and visual step ordering by +26.7%

artificial intelligence, large language model, natural language, (18 more...)

2509.21662

Country: North America > United States (0.93)

Genre:

Workflow (0.91)
Research Report (0.64)
Instructional Material > Training Manual (0.34)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceAug-5-2025

L3M+P: Lifelong Planning with Large Language Models

Agarwal, Krish, Jiang, Yuqian, Hu, Jiaheng, Liu, Bo, Stone, Peter

By combining classical planning methods with large language models (LLMs), recent research such as LLM+P has enabled agents to plan for general tasks given in natural language. However, scaling these methods to general-purpose service robots remains challenging: (1) classical planning algorithms generally require a detailed and consistent specification of the environment, which is not always readily available; and (2) existing frameworks mainly focus on isolated planning tasks, whereas robots are often meant to serve in long-term continuous deployments, and therefore must maintain a dynamic memory of the environment which can be updated with multi-modal inputs and extracted as planning knowledge for future tasks. To address these two issues, this paper introduces L3M+P (Lifelong LLM+P), a framework that uses an external knowledge graph as a representation of the world state. The graph can be updated from multiple sources of information, including sensory input and natural language interactions with humans. L3M+P enforces rules for the expected format of the absolute world state graph to maintain consistency between graph updates. At planning time, given a natural language description of a task, L3M+P retrieves context from the knowledge graph and generates a problem definition for classical planners. Evaluated on household robot simulators and on a real-world service robot, L3M+P achieves significant improvement over baseline methods both on accurately registering natural language state changes and on correctly generating plans, thanks to the knowledge graph retrieval and verification.

artificial intelligence, large language model, natural language, (18 more...)

2508.01917

Country: North America > United States > Texas (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)